Scalable K-Means++

نویسندگان

Bahman Bahmani

Benjamin Moseley

Andrea Vattani

Ravi Kumar

Sergei Vassilvitskii

چکیده

Over half a century old and showing no signs of aging, k-means remains one of the most popular data processing algorithms. As is well-known, a proper initialization of k-means is crucial for obtaining a good final solution. The recently proposed k-means++ initialization algorithm achieves this, obtaining an initial set of centers that is provably close to the optimum solution. A major downside of the k-means++ is its inherent sequential nature, which limits its applicability to massive data: one must make k passes over the data to find a good initial set of centers. In this work we show how to drastically reduce the number of passes needed to obtain, in parallel, a good initialization. This is unlike prevailing efforts on parallelizing k-means that have mostly focused on the post-initialization phases of k-means. We prove that our proposed initialization algorithm k-means| obtains a nearly optimal solution after a logarithmic number of passes, and then show that in practice a constant number of passes suffices. Experimental evaluation on realworld large-scale data demonstrates that k-means| outperforms k-means++ in both sequential and parallel settings.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Scalable Kernel k-Means via Centroid Approximation

Although kernel k-means is central for clustering complex data such as images and texts by implicit feature space embedding, its practicality is limited by the quadratic computational complexity. In this paper, we present a novel technique based on scalable centroid approximation that accelerates kernel k-means down to a sub-quadratic complexity. We prove near-optimality of our algorithm, and e...

متن کامل

Embed and Conquer: Scalable Embeddings for Kernel k-Means on MapReduce

The kernel k-means is an effective method for data clustering which extends the commonly-used k-means algorithm to work on a similarity matrix over complex data structures. It is, however, computationally very complex as it requires the complete kernel matrix to be calculated and stored. Further, its kernelized nature hinders the parallelization of its computations on modern scalable infrastruc...

متن کامل

Communication Challenges in Cloud K-means

This paper studies how parallel machine learning algorithms can be implemented on top of Microsoft Windows Azure cloud computing platform. More specifically, we design efficient storage based communication mechanisms that lead to a scalable implementation of the K-means.

متن کامل

Fast, single-pass K-means algorithms

We discuss the issue of how well K-means scales to large databases. We evaluate the performance of our implementation of a scalable variant of K-means, from Bradley, Fayyad and Reina (1998b), that uses several, fairly complicated, types of compression to t points into a xed size buuer, which is then used for the clustering. The running time of the algorithm and the quality of the resulting clus...

متن کامل

Wasserstein k-means++ for Cloud Regime Histogram Clustering

Much work has sought to discern the different types of cloud regimes, typically via Euclidean k-means clustering of histograms. However, these methods ignore the underlying similarity structure of cloud types. Wasserstein k-means clustering is a promising candidate for utilizing this structure during clustering, but existing algorithms do not scale well and lack the quality guarantees of the Eu...

متن کامل

Scalable Embeddings for Kernel Clustering on MapReduce

There is an increasing demand from businesses and industries to make the best use of their data. Clustering is a powerful tool for discovering natural groupings in data. The k-means algorithm is the most commonly-used data clustering method, having gained popularity for its effectiveness on various data sets and ease of implementation on different computing architectures. It assumes, however, t...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

PVLDB

دوره 5 شماره

صفحات -

تاریخ انتشار 2012

Scalable K-Means++

نویسندگان

چکیده

منابع مشابه

Scalable Kernel k-Means via Centroid Approximation

Embed and Conquer: Scalable Embeddings for Kernel k-Means on MapReduce

Communication Challenges in Cloud K-means

Fast, single-pass K-means algorithms

Wasserstein k-means++ for Cloud Regime Histogram Clustering

Scalable Embeddings for Kernel Clustering on MapReduce

عنوان ژورنال:

اشتراک گذاری